Kickstarter dataset project - Yasmine Maricar

Description

Reading and cleaning up the dataframe

I am omitting this step for the sake of clarity. This part would made us read from the original dataset and cleaning it up + adding relevant features if the original features are not right.

Going from the final dataframe

Let's drop these because we can see that there is 0 backers and no country nor usd pledged previously, it seems to be a mistake in getting the data

I'll leave it as it is, but it's interesting to see that some duplicates seem genuine, others seem to be about the same project revamped/relaunched and others are also another rendition of the same project (play at theater and video for instance...).

It would be interesting to know more about the motives and mindset of people creating these projects 'again' (needs of funds again), are there also possible cases of reboot of past successful projects (hoax ?).

Overall, it still can be integrated in our model as we want to predict the success/failure of a campaign regardless.

Distribution of goals and pledges

We take the log to better see the distributions as we have outliers in both cases.

Based on the above histogram, it seems the failed projects tend to have higher values (so higher goals)

Feature engineering

Variables for the logistic regression:

Others

to predict target variable state

I. Let's prepare the dataset to train the model

usd_goal is skewed, let's check the distribution here, let's replace it.

1. Generating html report with pandas profiling

2. Explore manually

We may consider the dataset is balanced because of the 60/40 % ratio

II. Model training

Preprocessing

Without sklearn pipelines

Let's try cross-validation

We can then use model.predict_proba(x_test)[:,1] to get the probabilities of label being positive for the target.

Conclusion

3) From what we have observed through EDA (I didn't leave all my code for this part here.) mostly, it seems better to do a project in:

Furthermore, it seems that projects with a duration of days below one month have better chances of success.

I think our study is incomplete because we are not studying the potential creators and backers interactions towards the project, the comments, number of shares throughout the web are what make the success of a kickstarter project aiming towards a reasonably high amount of money, by targetting the right people and generating contributions to the project in the alloted timeline. We can see that amongst the most successful categories, the mean usd_goal between failed and successful projects is different, failed projects tend to have higher amounts of money as a goal, thus, by keeping the goal similar to previously successful projects in the same domain, the chances to see the project succeed are better.

The factors of success of a project go far beyond what we have as a dataset in this study, as the real issue seems to be how people find these projects. Kickstarter is above all the hosting platform to receive these funds. However, it is interesting to see that we were able to detect some interesting insights and finish up with a final model that has around 68% accuracy.